19 research outputs found
Towards a discipline of performance engineering : lessons learned from stencil kernel benchmarks
High performance computing systems are characterized by a high level of complexity both on their hardware and software side. The hardware has evolved offering a lot of compute power, at the cost of an increasing effort needed to program the systems, whose software stack can be correctly managed only by means of ad-hoc tools.
Reproducibility has always been one of the cornerstones of science, but it is highly challenged by the complex ecosystem of software packages that run on HPC platforms, and also by some malpractices in the description of the configurations adopted in the experiments.
In this work, we first characterize the factor that affects the reproducibility of experiments in the field of high performance computing and then we define a taxonomy of the experiments and levels of reproducibility that can be achieved, following the guidelines of a framework that is presented.
A tool that implements said framework is described and used to conduct Performance Engineering experiments on kernels containing the stencil (structured grids) computational pattern. Due to the trends in architectural complexity of the new compute systems and the complexity of the software that runs on them, the gap between expected and achieved performance is widening. Performance engineering is critical to address such a gap, with its cycle of prediction, reproducible measurement, and optimization.
A selection of stencil kernels is first modeled and their performance predicted through a grey box analysis and then compared against the reproducible measurements. The prediction is then used to validate the measured performance and vice-versa, resulting in a "Gold Standard" that draws a path towards a discipline of performance engineering
Towards a Mini-App for Smoothed Particle Hydrodynamics at Exascale
The smoothed particle hydrodynamics (SPH) technique is a purely Lagrangian
method, used in numerical simulations of fluids in astrophysics and
computational fluid dynamics, among many other fields. SPH simulations with
detailed physics represent computationally-demanding calculations. The
parallelization of SPH codes is not trivial due to the absence of a structured
grid. Additionally, the performance of the SPH codes can be, in general,
adversely impacted by several factors, such as multiple time-stepping,
long-range interactions, and/or boundary conditions. This work presents
insights into the current performance and functionalities of three SPH codes:
SPHYNX, ChaNGa, and SPH-flow. These codes are the starting point of an
interdisciplinary co-design project, SPH-EXA, for the development of an
Exascale-ready SPH mini-app. To gain such insights, a rotating square patch
test was implemented as a common test simulation for the three SPH codes and
analyzed on two modern HPC systems. Furthermore, to stress the differences with
the codes stemming from the astrophysics community (SPHYNX and ChaNGa), an
additional test case, the Evrard collapse, has also been carried out. This work
extrapolates the common basic SPH features in the three codes for the purpose
of consolidating them into a pure-SPH, Exascale-ready, optimized, mini-app.
Moreover, the outcome of this serves as direct feedback to the parent codes, to
improve their performance and overall scalability.Comment: 18 pages, 4 figures, 5 tables, 2018 IEEE International Conference on
Cluster Computing proceedings for WRAp1
SPH-EXA: Enhancing the Scalability of SPH codes Via an Exascale-Ready SPH Mini-App
Numerical simulations of fluids in astrophysics and computational fluid
dynamics (CFD) are among the most computationally-demanding calculations, in
terms of sustained floating-point operations per second, or FLOP/s. It is
expected that these numerical simulations will significantly benefit from the
future Exascale computing infrastructures, that will perform 10^18 FLOP/s. The
performance of the SPH codes is, in general, adversely impacted by several
factors, such as multiple time-stepping, long-range interactions, and/or
boundary conditions. In this work an extensive study of three SPH
implementations SPHYNX, ChaNGa, and XXX is performed, to gain insights and to
expose any limitations and characteristics of the codes. These codes are the
starting point of an interdisciplinary co-design project, SPH-EXA, for the
development of an Exascale-ready SPH mini-app. We implemented a rotating square
patch as a joint test simulation for the three SPH codes and analyzed their
performance on a modern HPC system, Piz Daint. The performance profiling and
scalability analysis conducted on the three parent codes allowed to expose
their performance issues, such as load imbalance, both in MPI and OpenMP.
Two-level load balancing has been successfully applied to SPHYNX to overcome
its load imbalance. The performance analysis shapes and drives the design of
the SPH-EXA mini-app towards the use of efficient parallelization methods,
fault-tolerance mechanisms, and load balancing approaches.Comment: arXiv admin note: substantial text overlap with arXiv:1809.0801
Reducing the environmental impact of surgery on a global scale: systematic review and co-prioritization with healthcare workers in 132 countries
Abstract
Background
Healthcare cannot achieve net-zero carbon without addressing operating theatres. The aim of this study was to prioritize feasible interventions to reduce the environmental impact of operating theatres.
Methods
This study adopted a four-phase Delphi consensus co-prioritization methodology. In phase 1, a systematic review of published interventions and global consultation of perioperative healthcare professionals were used to longlist interventions. In phase 2, iterative thematic analysis consolidated comparable interventions into a shortlist. In phase 3, the shortlist was co-prioritized based on patient and clinician views on acceptability, feasibility, and safety. In phase 4, ranked lists of interventions were presented by their relevance to high-income countries and lowâmiddle-income countries.
Results
In phase 1, 43 interventions were identified, which had low uptake in practice according to 3042 professionals globally. In phase 2, a shortlist of 15 intervention domains was generated. In phase 3, interventions were deemed acceptable for more than 90 per cent of patients except for reducing general anaesthesia (84 per cent) and re-sterilization of âsingle-useâ consumables (86 per cent). In phase 4, the top three shortlisted interventions for high-income countries were: introducing recycling; reducing use of anaesthetic gases; and appropriate clinical waste processing. In phase 4, the top three shortlisted interventions for lowâmiddle-income countries were: introducing reusable surgical devices; reducing use of consumables; and reducing the use of general anaesthesia.
Conclusion
This is a step toward environmentally sustainable operating environments with actionable interventions applicable to both highâ and lowâmiddleâincome countries
Trusted high-performance computing in the classroom
A well-designed high-performance computing (HPC) course not only presents theoretical parallelism concepts but also includes practical work on parallel systems. Today's machine models are diverse and as a consequence multiple programming models exist. The challenge for HPC course lecturers is to decide what to include and what to exclude, respectively. We have experience in teaching HPC in a multi-paradigm style. The practical course parts include message-passing programming using MPI, directive-based shared memory programming using OpenMP, partitioned global address space based programming using Chapel, and domain-specific programming using a high-level framework. If these models are taught in an isolated mode, students would have problems in assessing the strengths and weaknesses of the approaches presented. We propose a project-based approach which introduces a specific problem to be solved (in our case a stencil computation) and asks for solutions using the programming approaches introduced. Our course has been successfully taught several times but a major problem has always been checking the individual student solutions, especially to decide which performance results reported one can trust. In order to overcome these deficiencies, we have built a pedagogical tool which enhances the trust in students' work. In the paper we present the infrastructure and tools that make student experiments easily reproducible by lecturers. We introduce a taxonomy for general benchmark experiments, describe the distributed architecture of our development and analysis environment, and, as a case study, discuss performance experiments when solving a stencil problem in multiple programming models
Reproducible experiments in parallel computing: concepts and stencil compiler benchmark study
For decades, the majority of the experiments on parallel computers have been reported at conferences and in journals usually without the possibility to verify the results presented. Thus, one of the major principles of science, reproducible results as a kind of correctness proof, has been neglected in the field of experimental high-performance computing. While this is still the state-of-the-art, current research targets for solutions to this problem. We discuss early results regarding reproducibility from a benchmark case study we did. In our experiments we explore the class of stencil calculations that are part of many scientific kernels and compare the performance results of four stencil compilers. In order to make these experiments reproducible from remote, a first prototype of an replication engine has been developed that can be accessed via the internet
Performance output data and configurations of stencil compilers experiments run through PROVA!
The data in this article are related to the research article titled âReproducible Stencil Compiler Benchmarks Using PROVA!â. Stencil kernels have been implemented using a naĂŻve OpenMP (OpenMP Architecture Review Board, 2016) [1] parallelization and then using the stencil compilers PATUS (Christen et al., 2011) [2] and (Bondhugula et al., 2008) PLUTO [3]. Performance experiments have been run on different architectures, by using PROVA! (Guerrera et al., 2017) [4], a distributed workflow and system management tool to conduct reproducible research in computational sciences. Information like version of the compiler, compilation flags, configurations, experiment parameters and raw results are fundamental contextual information for the reproducibility of an experiment. All this information is automatically stored by PROVA! and, for the experiments presented in this paper, are available at https://github.com/sguera/FGCS17
Reproducible Benchmarking of Parallel Stencil Codes
State of the art in performance reporting in the High Performance Computing field is omitting details that are important in order to be able to test and reuse them, affecting what is considered to be a pillar of science since 17 th century: the scientific method. Every scientist must be able to understand and extend the work of another. Modern architectures are extremely complex and scientists often focus their efforts on what they are pursuing, ignoring the importance of making their science reproducible. We acknowledge the lack of time and the effort necessary to follow good practices in conducting research in computational science, yet consider reproducibility to be extremely important. For this reason, we first designed and implemented a software framework, prova!, to help scientists in their research allowing them to focus on their core business and taking care of its reproducibility, and then used it in our own research on benchmarking of parallel stencil codes. Stencil codes are an important and widely used pattern in computational science, but, due to their low arithmetic intensity, it is tough to achieve a good performance. For this reason several approaches to stencil computation have been proposed and several stencil compilers implemented. Starting from the problem of stencil computation we dive into stencil compilers. How to compare them? Many factors can affect the outcome of a comparison, such as the stencil itself (what if we change the stencil computation?), the architecture (does compiler A always outperform compiler B, even on a different architecture?). How to evaluate them while changing both the stencil to compute and the architecture? Using prova! we can provide a comparison of the same stencil on different architectures and different stencils on both the same or different machines. Benchmarking performance models are needed to individuate the relevant parameters. We plan to interpret the performance data by means of models such as the Execution-Cache-Memory Performance Model and evaluate how performance is affected by explicitly pinning the threads, following different pinning strategies. The goal is to understand how compilers behave in relation to the architecture chosen and to state a subset of parameters to take into account, in order to predict the final performance. Furthermore, we propose a standardized way of describing an experiment, extrapolating a minimum amount of information needed for reproducibility purposes. We try to address the questions: âAre the results of our research reproducible? Can other scientists trust our conclusions?â. For this reason we conduct our research using prova!, a framework which will strengthen and give credibility to our results: through it we make available source code, dependencies, environment, build process running and post-processing scripts, and the raw data used for generating the graphs. Conducting research this way has several positive effects: you remain with a complete documentation, it helps the follow-up of studies allowing to build upon existing work and, in case of discrepancies during a replication, it helps in identifying and addressing the root of the problem